Measures of central tendency (location)

The value around which the distribution is centered

  1. Mean
  2. Median
  3. Mode (most common value)

Mean

Sample (arithmetic) mean:

\[\bar{Y} = \frac{\sum^n_{i=1}Y_i}{n}\]

The term “mean” is preferred to “average.” The arithmetic mean is one kind of average, but there are others (as there are other types of means).

  • One argument: “Average” is a statistic determined by an arithmetic procedure. “Mean” is a parameter.

Median

The median is the central measurement of a sample (the 50th percentile and the 0.50 quantile). If \(n\) is even, then the median is the mean of the two middle observations.

1:10
##  [1]  1  2  3  4  5  6  7  8  9 10
median(1:10)
## [1] 5.5
quantile(1:10, prob = 0.5)
## 50% 
## 5.5

Mean vs. Median

Number of lateral plates (plates) in threespine sticklebacks (Gasterosteus aculeatus) with three different Ectodysplasin genotypes (mm, Mm, and MM).

Mean vs. Median

glimpse(SticklebackPlates)
## Rows: 344
## Columns: 2
## $ genotype <chr> "mm", "Mm", "Mm", "Mm", "mm", "mm", "Mm", "Mm", "Mm", "Mm", …
## $ plates   <dbl> 11, 63, 22, 10, 14, 11, 58, 36, 31, 61, 63, 8, 12, 11, 64, 6…

Mean vs. Median

Mean vs. Median

SticklebackPlates %>% group_by(genotype) %>% 
  summarize(mean_plate = mean(plates),
            median_plate = median(plates),
            .groups = "drop")
## # A tibble: 3 x 3
##   genotype mean_plate median_plate
## * <chr>         <dbl>        <dbl>
## 1 mm             11.7           11
## 2 Mm             50.4           59
## 3 MM             62.8           63

Mean is sensitive to extreme values

When might you substitute the median for the mean?

(As a measure of central tendency)

Why don’t we always use the median?

Frameworks for inference

  1. Analytical
  2. Maximum likelihood
  3. Resampling
  4. Bayesian

Flying snake

Flying snake

Inferring a mean

Mean undulation rate for \(n = 8\) gliding snakes:

undulation_rate <- c(0.9, 1.2, 1.2, 1.3, 1.4, 1.4, 1.6, 2.0)

What is the mean undulation rate for this sample of flying snakes?

Undulation rate

Analytical inference of mean

Arithmetic mean:

\[\hat{Y} = \frac{\sum_{i=1}^{n}Y_i}{n}\]

\[mean~undulation~rate = \frac{\sum_{i=1}^{n}undulation~rate_i}{n}\]

Analytical inference of mean

sum(undulation_rate) / length(undulation_rate)
## [1] 1.375
mean(undulation_rate)
## [1] 1.375

Maximum likelihood inference of mean

Use dnorm() to calculate the relative likelihood of an observed value \(Y_i\) drawn from a normal distribution given a mean (\(\mu\)) and standard deviation (\(\sigma\)).

\[f\left(Y_i; \mu, \sigma\right) = \frac{1}{\sqrt{2\pi\sigma^{2}}} e^{\frac{-\left(Y_i-\mu\right)^{2}}{2\sigma^{2}}}\]

Standard normal distribution

dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423

Calculating a likelihood

Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?

  1. This is a model.
  2. Calculate the relative likelihood of each observation
  3. Model likelihood is the product of the individual likelihoods
  4. log-likelihood is more tractable, so calculate that

Model Likelihood (\(\mathcal{L}\))

For a set of observations (\(Y_i\)) and hypothesized parameters (\(\Theta\); i.e., mean and standard deviation) the model likelihood is the product of the observations’ individual likelihoods:

\[\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right) = \prod_{i=1}^{n} Pr\left[Y_{i}; \Theta\right]\]

\[\log\left(\mathcal{L}\left(\left\{ Y_{i}\right\} _{i=1}^{n};\Theta\right)\right) = \sum_{i=1}^{n} \log\left(Pr\left[Y_{i};\Theta\right]\right)\]

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Hypothesizing that the population mean is 0 and the standard deviation is 1, what is the likelihood of the observed values?

Likelihood for the first observation (undulation_rate[1]):

undulation_rate[1]
## [1] 0.9
dnorm(undulation_rate[1], mean = 0, sd = 1)
## [1] 0.2660852

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

This is only the likelihood for one observation. We need the likelihoods for all 8 undulation rates to get a model likelihood.

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Vector of likelihoods for all values in undulation_rate given mu = 0 and sigma = 1:

(rel_liks <- dnorm(undulation_rate, mean = 0, sd = 1))
## [1] 0.26608525 0.19418605 0.19418605 0.17136859 0.14972747 0.14972747 0.11092083
## [8] 0.05399097

Calculating the log-likelihood for a single combination of \(\mu\) and \(\sigma\)

Model likelihood is the product of those likelihoods:

(lik <- prod(rel_liks))
## [1] 2.308476e-07

Likelihood to log-likelihood

log(lik)
## [1] -15.28151

Rather than logging the product, we can sum the log-likelihoods:

sum(log(rel_liks))
## [1] -15.28151

For a model in which the mean is 0 and standard deviation is 1, the model log-likelihood is -15.28.

Higher likelihood

Is there another combination of \(\mu\) and \(\sigma\) that gives a higher likelihood (= larger log-likelihood)?

Try \(\mu = 1\) and \(\sigma = 1\):

sum(log(dnorm(undulation_rate, mean = 1, sd = 1)))
## [1] -8.281508

This is an improvement over \(\mu = 0\) and \(\sigma = 1\).

Calculating the log-likelihood for a range of \(\mu\) and \(\sigma\)

Find the combination of \(\mu\) and \(\sigma\) that maximizes the log-likelihood of the model for the mean and standard deviation of undulation rates.

Ranges of possible values:

  1. Mean (\(\mu\)): \(-\infty < \mu < \infty\)
  2. Standard deviation (\(\sigma\)): \(0 < \sigma < \infty\)

Grid approximation

For combinations of \(\mu\) and \(\sigma\), calculate the model likelihood. Pick the largest log-likelihood as the parameter estimates.

Set up the grid:

n <- 100                           # How fine is the grid?
mu <- seq(0.1, 3, length = n)      # Vector of mu
sigma <- seq(0.1, 0.5, length = n) # Vector of sigma

grid_approx <- crossing(mu, sigma)

grid_approx
## # A tibble: 10,000 x 2
##       mu sigma
##    <dbl> <dbl>
##  1   0.1 0.1  
##  2   0.1 0.104
##  3   0.1 0.108
##  4   0.1 0.112
##  5   0.1 0.116
##  6   0.1 0.120
##  7   0.1 0.124
##  8   0.1 0.128
##  9   0.1 0.132
## 10   0.1 0.136
## # … with 9,990 more rows

Grid approximation

log_lik <- numeric(length = nrow(grid_approx))

for (ii in 1:nrow(grid_approx)) {
  log_lik[ii] <- 
    sum(dnorm(undulation_rate,
              mean = grid_approx$mu[ii],
              sd = grid_approx$sigma[ii],
              log = TRUE))
}
grid_approx$log_lik <- log_lik
  • Iterate through the rows (\(ii\)) of grid_approx
  • For each row, assign the model log-likelihood calculated for that mu and sigma to log_lik

grid_approx
## # A tibble: 10,000 x 3
##       mu sigma log_lik
##    <dbl> <dbl>   <dbl>
##  1   0.1 0.1     -676.
##  2   0.1 0.104   -624.
##  3   0.1 0.108   -578.
##  4   0.1 0.112   -536.
##  5   0.1 0.116   -499.
##  6   0.1 0.120   -466.
##  7   0.1 0.124   -436.
##  8   0.1 0.128   -408.
##  9   0.1 0.132   -384.
## 10   0.1 0.136   -361.
## # … with 9,990 more rows
  • For a 100 X 100 grid, there are 10,000 calculations.
  • If there were 3 parameters, there would be 1,000,000.

Visualizing the likelihood surface

Grid approximation

On this grid, the maximum likelihood estimates of \(\mu\) and \(\sigma\) are:

grid_approx[which.max(grid_approx$log_lik), ]
##            mu     sigma   log_lik
## 4451 1.388889 0.3020202 -1.810766

The analytical estimates are:

mean(undulation_rate)
## [1] 1.375
sd(undulation_rate)
## [1] 0.324037

Maximum likelihood via optimization

Search for the most likely values of \(\mu\) and \(\sigma\) across all possible values.

Maximum likelihood via optimization

Define a function that takes a vector of values to optimize x (\(\mu\) and \(\sigma\)) as well as a set of data Y and returns the log-likelihood:

log_lik <- function(x, Y){
  liks <- dnorm(Y, mean = x[1], sd = x[2], log = TRUE)
  return(sum(liks))
}

We can now simultaneously optimize \(\mu\) and \(\sigma\), maximizing the log-likelihood.

Maximum likelihood via optimization

reltol says to stop when the improvement is \(<10^{-100}\).

optim(c(0.1, 0.1), # Start at 0.1, 0.1
      log_lik,
      Y = undulation_rate,
      control = list(fnscale = -1,
                     reltol = 10^-100))
## $par
## [1] 1.3750000 0.3031089
## 
## $value
## [1] -1.802203
## 
## $counts
## function gradient 
##      175       NA 
## 
## $convergence
## [1] 0
## 
## $message
## NULL

Maximum likelihood via optimization

glm() fits generalized linear modules via optimization:

fm <- glm(undulation_rate ~ 1) # Estimate a mean only
coef(fm)
## (Intercept) 
##       1.375
logLik(fm)
## 'log Lik.' -1.802203 (df=2)

For a small enough tolerance, the maximum likelihood estimate equals the analytical estimate.